AMENDMENTS TO THE CLAIMS 



Claims Pending: 

• At time of the Action: Claims 1, 3-6, 10-19, 28-39 

• Amended Claims: Claims 1, 28, and 36 

• New Claims: Claims 40-44 

• After this Response: Claims 1, 3-6, 10-19, and 28-44 

This listing of claims will replace all prior versions and listings, of claims in the application: 

1. (Currently Amended) A method of using a tuning set of information to 
jointly optimize the performance and size of a language model, comprising: 

providing a textual corpus comprising subsets wherein each subset comprises a 
plurality of items : 

creating a Dynamic Order Markov Model data structure by assigning each item of 
the plurality of items to a node in the data structure, wherein the nodes are logically 
coupled to denote dependencies of the items, and calculating a frequency of occurrence for 
each item of the plurality of items: 

segmenting at least a subset of a received textual corpus into segments by 
clustering every N-items of the received corpus into a training unit, wherein resultant 
training units are separated by gaps, and wherein N is an empirically derived value based, 
at least in part, on the size of the received corpus; 

creating the tuning set from application-specific information; 

(a) training a seed model via the tuning set; 

(b) calculating a similarity within a sequence of the training units on either side 
of each of the gaps; 

(c) selecting segment boundaries that maximize intra-segment similarity and 
inter-segment disparity; 

(d) calculating a perplexity value for each segment based on a comparison with 
the seed model; 
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(e) selecting some of the segments based on their respective perplexity values 
to augment the tuning set; 

iteratively refining the tuning set and the seed model by repeating steps (a) through 
(e) ttfrtil with respect to a threshold; 

and 

refining the language model based on the seed model. 

2. (Canceled). 

3. (Original) A method according to claim 1, wherein the tuning set of 
information is comprised of one or more application-specific documents. 

4. (Original) A method according to claim 1, wherein the tuning set of 
information is a highly accurate set of textual information linguistically relevant to, but not 
taken from, the received textual corpus. 

5. (Original) A method according to claim 1, further comprising a training set 
comprised of at least the subset of the received textual corpus. 

6. (Original) A method according to claim 5, further comprising: 

ranking the segments of the training set based, at least in part, on the calculated 
perplexity value for each segment. 

7. -9- (Canceled). 

10. (Previously Presented) A method according to claim 1, wherein the 
calculation of the similarity within a sequence of training units defines a cohesion score. 
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11. (Original) A method according to claim 10, wherein intra-segment 
similarity is measured by the cohesion score. 

12. (Previously Presented) A method according to claim 10, wherein inter- 
segment disparity is approximated from the cohesion score. 

13. (Original) A method according to claim 1, wherein the calculation of inter- 
segment disparity defines a depth score. 

14. (Original) A method according to claim 1, wherein the perplexity value is a 
measure of the predictive power of a certain language model to a segment of the received 
corpus. 

15. (Original) A method according to claim 1 , further comprising: 

ranking the segments of at least the subset of the received corpus based, at least in 
part, on the calculated perplexity value of each segment; and 

updating the tuning set of information with one or more of the segments from at 
least the subset of the received corpus. 

16. (Original) A method according to claim 15, wherein one or more of the 
segments with the lowest perplexity value from at least the subset of the received corpus 
are added to the tuning set. 

17. (Original) A method according to claim 1, further comprising: 

utilizing the refined language model in an application to predict a likelihood of 
another corpus. 



Attorney Docket Nc 



Response to Office Action April 1 9. 



18. (Original) A storage medium comprising a plurality of executable 
instructions including at least a subset of which, when executed, implement a method 
according to claim 1 . 

19. (Original) A system comprising: 

a storage medium having stored therein a plurality of executable instructions; and 
an execution unit, coupled to the storage medium, to execute at least a subset of the 
plurality of executable instructions to implement a method according to claim 1. 

20-27. (Canceled). 

28. (Currently Amended) A modeling agent comprising: 
a controller, to receive invocation requests to develop a language model from a 
textual corpus comprising subsets wherein each subset comprises a plurality of items and 
to calculate a frequency of occurrence for each item of the plurality of items: and 
a data structure generator, responsive to the controller to: 

create a Dynamic Order Markov Model data structure by assigning each 
item of the plurality of items to a node in the data structure, wherein the nodes are 
logically coupled to denote dependencies of the items: 

develop a seed model from a tuning set of information; 

segment at least a subset of a received corpus, wherein the segments of the 
received corpus are a clustering of every N items of the received corpus into a 
training unit, wherein N is an empirically derived value based, at least in part, on 
the size of the received corpus, and the training units are separated by gaps; 

calculate the similarity within a sequence of training units on either side of 
each of the gaps; 

select segment boundaries that improve intra-segment similarity and inter- 
segment disparity; 

calculate a perplexity value for each segment; 
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refine the seed model with one or more segments of the received corpus 
based, at least in part, on the calculated perplexity values; 

iteratively refine the tuning set with segments ranked by the seed model and 
in turn iteratively update the seed model via the refined tuning set; 

filter the received corpus via the seed model to find low-perplexity 
segments; and 

train the language model via the low-perplexity segments. 

29. (Original) A modeling agent according to claim 28, wherein the tuning set 
is dynamically selected as relevant to the received corpus. 

30. (Original) A modeling agent according to claim 28, the data structure 
generator comprising: 

a dynamic lexicon generation function, to develop an initial lexicon from the tuning 
set, and to update the lexicon with select segments from the received corpus. 

31. (Original) A modeling agent according to claim 28, the data structure 
generator comprising: 

a frequency analysis function, to determine a frequency of occurrence of segments 
within the received corpus. 

32. (Original) A modeling agent according to claim 28, the data structure 
generator comprising: 

a dynamic segmentation function, to iteratively segment the received corpus to 
improve a predictive performance attribute of the modeling agent. 

33. (Original) A modeling agent according to claim 32, wherein the dynamic 
segmentation function iteratively re-segments the received corpus until the language model 
reaches an acceptable threshold. 
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34. (Original) A modeling agent according to claim 32, the data structure 
generator further comprising: 

a frequency analysis function, to determine a frequency of occurrence of segments 
within the received corpus. 

35. (Original) A modeling agent according to claim 34, wherein the data 
structure generator selectively removes segments from the data structure that do not meet a 
minimum frequency threshold, and dynamically re-segments the received corpus to 
improve predictive capability while reducing the size of the data structure. 

36. (Currently Amended) A method of jointly optimizing the performance and 
size of a language model, comprising: 

providing a textual corpus comprising subsets wherein each subset comprises a 
plurality of items: 

creating a Dynamic Order Markov Model data structure by assigning each item of 
the plurality of items to a node in the data structure, wherein the nodes are logically 
coupled to denote dependencies of the items, and calculating a frequency of occurrence for 
each item of the plurality of subsets: 

segmenting one or more relatively large language corpora into multiple segments of 
N items, wherein N is an empirically derived value based, at least in part, on the size of the 
received corpus; 

selecting an initial tuning sample of application- specific data, the initial tuning 
sample being relatively small in comparison to the one or more relatively large language 
corpora, wherein the initial tuning sample is used for training a seed model, the seed model 
to be used for ranking the multiple segments from the language corpora; 

iteratively training the seed model to obtain a mature seed model, wherein the 
iterative training proceeds until a threshold is reached, each iteration of the training 
including: 

updating the seed model according to the tuning sample; 
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ranking each of the multiple segments according to a perplexity comparison 
with the seed model; 

selecting some of the multiple segments that possess a low perplexity; and 
augmenting the tuning sample with the selected segments; 
once the threshold is reached, filtering the language corpora according to the 
mature seed model to select low-perplexity segments; 

combining data from the low-perplexity segments; and 
training the language model according to the combined data. 

37. (Previously Presented) The method as recited in claim 36, wherein the 
selecting an initial tuning sample comprises selecting a few application-specific 
documents. 

38. (Previously Presented) The method as recited in claim 36, wherein the 
threshold comprises one of a predetermined size of the seed model or a sufficient 
application specificity of the seed model. 

39. (Previously Presented) The method as recited in claim 36, further 
comprising pruning the language model utilizing an entropy based cutoff algorithm that 
uses only information embedded in the language model itself. 

40. (New) The method as recited in claim 1, further comprising pruning the 
Dynamic Order Markov Model data structure. 

41. (New) A modeling agent according to claim 28, wherein the data structure 
generator selectively prunes the Dynamic Order Markov Model data structure. 

42. (New) The method as recited in claim 36, further comprising pruning the 
Dynamic Order Markov Model data structure. 
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43. (New) A storage medium comprising a plurality of executable instructions 
including at least a subset of which, when executed, implement a method according to 
claim 36. 

44. (New) A storage medium comprising a plurality of executable instructions 
including at least a subset of which, when executed, implement a method according to 
claim 39. 
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